first-person perspective
Assessing Consciousness-Related Behaviors in Large Language Models Using the Maze Test
Pimenta, Rui A., Schlippe, Tim, Schaaff, Kristina
--We investigate consciousness-like behaviors in Large Language Models (LLMs) using the Maze T est, challenging models to navigate mazes from a first-person perspective. After synthesizing consciousness theories into 13 essential characteristics, we evaluated 12 leading LLMs across zero-shot, one-shot, and few-shot learning scenarios. Results showed reasoning-capable LLMs consistently outperforming standard versions, with Gemini 2.0 Pro achieving 52.9% Complete Path Accuracy and DeepSeek-R1 reaching 80.5% Partial Path Accuracy . The gap between these metrics indicates LLMs struggle to maintain coherent self-models throughout solutions--a fundamental consciousness aspect. While LLMs show progress in consciousness-related behaviors through reasoning mechanisms, they lack the integrated, persistent self-awareness characteristic of consciousness. The emergence of human-like capabilities in AI has been debated since the field's inception in the 1950s [1], [2]. An early case was ELIZA [3], a chatbot simulating a therapist. Though based on pattern matching, its responses were so convincing that Weizenbaum's secretary requested privacy for a "real conversation"--showing how humans can mistakenly perceive consciousness in even the simplest AI systems.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
EgoBlind: Towards Egocentric Visual Assistance for the Blind People
Xiao, Junbin, Huang, Nanxin, Qiu, Hao, Tao, Zhulin, Yang, Xun, Hong, Richang, Wang, Meng, Yao, Angela
We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,210 videos that record the daily lives of real blind users from a first-person perspective. It also features 4,927 questions directly posed or generated and verified by blind individuals to reflect their needs for visual assistance under various scenarios. We provide each question with an average of 3 reference answers to alleviate subjective evaluation. Using EgoBlind, we comprehensively evaluate 15 leading MLLMs and find that all models struggle, with the best performers achieving accuracy around 56\%, far behind human performance of 87.4\%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and provide heuristic suggestions for improvement. With these efforts, we hope EgoBlind can serve as a valuable foundation for developing more effective AI assistants to enhance the independence of the blind individuals' lives.
- North America > United States (0.04)
- Asia > Singapore (0.04)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- (5 more...)
- Education (0.67)
- Health & Medicine > Consumer Health (0.46)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Exo-ViHa: A Cross-Platform Exoskeleton System with Visual and Haptic Feedback for Efficient Dexterous Skill Learning
Chao, Xintao, Mu, Shilong, Liu, Yushan, Li, Shoujie, Lyu, Chuqiao, Zhang, Xiao-Ping, Ding, Wenbo
Imitation learning has emerged as a powerful paradigm for robot skills learning. However, traditional data collection systems for dexterous manipulation face challenges, including a lack of balance between acquisition efficiency, consistency, and accuracy. To address these issues, we introduce Exo-ViHa, an innovative 3D-printed exoskeleton system that enables users to collect data from a first-person perspective while providing real-time haptic feedback. This system combines a 3D-printed modular structure with a slam camera, a motion capture glove, and a wrist-mounted camera. Various dexterous hands can be installed at the end, enabling it to simultaneously collect the posture of the end effector, hand movements, and visual data. By leveraging the first-person perspective and direct interaction, the exoskeleton enhances the task realism and haptic feedback, improving the consistency between demonstrations and actual robot deployments. In addition, it has cross-platform compatibility with various robotic arms and dexterous hands. Experiments show that the system can significantly improve the success rate and efficiency of data collection for dexterous manipulation tasks.
Entering Real Social World! Benchmarking the Social Intelligence of Large Language Models from a First-person Perspective
Hou, Guiyang, Zhang, Wenqi, Shen, Yongliang, Tan, Zeqi, Shen, Sihao, Lu, Weiming
Social intelligence is built upon three foundational pillars: cognitive intelligence, situational intelligence, and behavioral intelligence. As large language models (LLMs) become increasingly integrated into our social lives, understanding, evaluating, and developing their social intelligence are becoming increasingly important. While multiple existing works have investigated the social intelligence of LLMs, (1) most focus on a specific aspect, and the social intelligence of LLMs has yet to be systematically organized and studied; (2) position LLMs as passive observers from a third-person perspective, such as in Theory of Mind (ToM) tests. Compared to the third-person perspective, ego-centric first-person perspective evaluation can align well with actual LLM-based Agent use scenarios. (3) a lack of comprehensive evaluation of behavioral intelligence, with specific emphasis on incorporating critical human-machine interaction scenarios. In light of this, we present EgoSocialArena, a novel framework grounded in the three pillars of social intelligence: cognitive, situational, and behavioral intelligence, aimed to systematically evaluate the social intelligence of LLMs from a first-person perspective. With EgoSocialArena, we have conducted a comprehensive evaluation of eight prominent foundation models, even the most advanced LLMs like o1-preview lag behind human performance by 11.0 points.
- Leisure & Entertainment > Games (1.00)
- Education (0.68)
Diffusing in Someone Else's Shoes: Robotic Perspective Taking with Diffusion
Spisak, Josua, Kerzel, Matthias, Wermter, Stefan
Humanoid robots can benefit from their similarity to the human shape by learning from humans. When humans teach other humans how to perform actions, they often demonstrate the actions and the learning human can try to imitate the demonstration. Being able to mentally transfer from a demonstration seen from a third-person perspective to how it should look from a first-person perspective is fundamental for this ability in humans. As this is a challenging task, it is often simplified for robots by creating a demonstration in the first-person perspective. Creating these demonstrations requires more effort but allows for an easier imitation. We introduce a novel diffusion model aimed at enabling the robot to directly learn from the third-person demonstrations. Our model is capable of learning and generating the first-person perspective from the third-person perspective by translating the size and rotations of objects and the environment between two perspectives. This allows us to utilise the benefits of easy-to-produce third-person demonstrations and easy-to-imitate first-person demonstrations. The model can either represent the first-person perspective in an RGB image or calculate the joint values. Our approach significantly outperforms other image-to-image models in this task.
- Europe > Germany > Hamburg (0.04)
- Europe > Switzerland (0.04)
Can Vision-Language Models Think from a First-Person Perspective?
Cheng, Sijie, Guo, Zhicheng, Wu, Jingwen, Fang, Kechen, Li, Peng, Liu, Huaping, Liu, Yang
Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate eighteen popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (3 more...)
Semantic Segmentation for Autonomous Driving: Model Evaluation, Dataset Generation, Perspective Comparison, and Real-Time Capability
Cakir, Senay, Gauß, Marcel, Häppeler, Kai, Ounajjar, Yassine, Heinle, Fabian, Marchthaler, Reiner
Environmental perception is an important aspect within the field of autonomous vehicles that provides crucial information about the driving domain, including but not limited to identifying clear driving areas and surrounding obstacles. Semantic segmentation is a widely used perception method for self-driving cars that associates each pixel of an image with a predefined class. In this context, several segmentation models are evaluated regarding accuracy and efficiency. Experimental results on the generated dataset confirm that the segmentation model FasterSeg is fast enough to be used in realtime on lowpower computational (embedded) devices in self-driving cars. A simple method is also introduced to generate synthetic training data for the model. Moreover, the accuracy of the first-person perspective and the bird's eye view perspective are compared. For a $320 \times 256$ input in the first-person perspective, FasterSeg achieves $65.44\,\%$ mean Intersection over Union (mIoU), and for a $320 \times 256$ input from the bird's eye view perspective, FasterSeg achieves $64.08\,\%$ mIoU. Both perspectives achieve a frame rate of $247.11$ Frames per Second (FPS) on the NVIDIA Jetson AGX Xavier. Lastly, the frame rate and the accuracy with respect to the arithmetic 16-bit Floating Point (FP16) and 32-bit Floating Point (FP32) of both perspectives are measured and compared on the target hardware.
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
- Asia (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Transportation > Ground > Road (1.00)
- Information Technology > Robotics & Automation (0.90)
Facebook is now developing a human-like artificial intelligence called Ego4D
Facebook announced a research project Thursday that aims to develop an artificial intelligence capable of perceiving the world like a human being. The project, titled Ego4D, aims to train an artificial intelligence (AI) to perceive the world in the first-person by analyzing a constant stream of video from people's lives. This type of data, which Facebook calls "egocentric" data, is designed to help the AI perceive, remember and plan like a human being. "Next-generation AI systems will need to learn from an entirely different kind of data -- videos that show the world from the center of the action, rather than the sidelines," Kristen Grauman, lead AI research scientist at Facebook, said in the announcement. The project aims to improve AI technology's capacity to accomplish human processes by setting five key benchmarks: "episodic memory," in which the AI ties memories to specific locations and times, "forecasting," "social interaction," "hand and object manipulation" and "audio-visual diarization," in which the AI ties auditory experiences to specific locations and times.
Teaching AI to perceive the world through your eyes
AI that understands the world from a first-person point of view could unlock a new era of immersive experiences, as devices like augmented reality (AR) glasses and virtual reality (VR) headsets become as useful in everyday life as smartphones. Imagine your AR device displaying exactly how to hold the sticks during a drum lesson, guiding you through a recipe, helping you find your lost keys, or recalling memories as holograms that come to life in front of you. To build these new technologies, we need to teach AI to understand and interact with the world like we do, from a first-person perspective -- commonly referred to in the research community as egocentric perception. Today's computer vision (CV) systems, however, typically learn from millions of photos and videos that are captured in third-person perspective, where the camera is just a spectator to the action. "Next-generation AI systems will need to learn from an entirely different kind of data -- videos that show the world from the center of the action, rather than the sidelines," says Kristen Grauman, lead research scientist at Facebook.
- Asia > Singapore (0.05)
- South America > Colombia (0.04)
- North America > United States > Pennsylvania (0.04)
- (8 more...)
Facebook: Here comes the AI of the Metaverse
To operate in augmented and virtual reality, Facebook believes artificial intelligence will need to develop an "egocentric perspective." To that end, the company on Thursday announced Ego4D, a data set of 2,792 hours of first-person video, and a set of benchmark tests for neural nets, designed to encourage the development of AI that is savvier about what it's like to move through virtual worlds from a first-person perspective. The project is a collaboration between Facebook Reality Labs and scholars from 13 research institutions, including academic institutions and research labs. The details are laid out in a paper lead-authored by Facebook's Kristen Grauman, "Ego4D: Around the World in 2.8K Hours of Egocentric Video." Grauman is a scientist with the company's Facebook AI Research unit.